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Abstract 



A nonparametric and locally adaptive Bayesian estimator is proposed for esti- 

I ^1 mating a binary regression. Flexibility is obtained by modeling the binary regres- 

1— I sion as a mixture of probit regressions with the argument of each probit regression 
> 

l/^ having a thin plate spline prior with its own smoothing parameter and with the 

lO mixture weights depending on the covariates. The estimator is compared to a sin- 
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gle spline estimator and to a recently proposed locally adaptive estimator. The 
methodology is illustrated by applying it to both simulated and real examples. 
KEY WORDS: Bayesian analysis; Markov chain Monte Carlo; Mixture-of-Experts; 
Model averaging; Surface Estimation; Reversible jump; 

1 Introduction 



Suppose we wish to model the spatial distribution of the habitat of the crested lark. One 
way to do this is to model the probability of a crested lark sighting as a function of latitude 
{lat) and longitude {Ion) as 

Pr(crested lark sighting|^at, Ion) = H {g{lat, Ion)} , (1.1) 
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where H is a link function, such as a probit or logit, and g is a function of latitude and 
longitude, which is either parametric or nonparametric. Our article presents a Bayesian 



method for estimating the probability in (1.1) that does not assume a parametric form for 



H and allows the probability to be locally adaptive with respect to the covariates, that is, to 
be smooth in one region of the covariate space and wiggly or even discontinuous in another. 

We model the binary regression as a mixture of probit binary regressions 

r 

Pr(crested lark sighting|Lat, Lon) = TTj{Lat, Lon)^ {gj{Lat, Lon)} , (1-2) 

i=i 

where $ is the standard normal cumulative distribution function and the gj are truncated 
spline functions. We demonstrate that the resulting estimators are nonparametric and locally 
adaptive, but do not overfit. The reasons for this good performance are that mixing is done 
outside the probit cumulative distribution function rather than inside, the weights ttj are 
allowed to vary with the covariates, and that the component functions gj can have different 
level of smoothness by having different smoothing parameters. This is especially important 
when modeling surfaces in two or more dimensions where a single smoothing parameter for a 
multidimensional surface will often be inadequate. The use of truncated spline bases allows 
models with several thousand observations and regression surfaces with a moderate number 
(at least 6 or 7) of covariates (provided the number of observations is adequate). Extending 
the methodology to higher dimensions problems is important because as the number of 
covariates increase so too does the need for local smoothing. This is because the smoothness 
of the function H{x) is likely to be different for different covariates. Our article allows the 
number of components to vary from r = 1, . . . , R, with R typically 3 or 4. We use a Bayesian 
approach and construct a Markov chain Monte Carlo sampling scheme (MCMC) to estimate 
the model that uses the reversible jump method of Green (1995) to move between model 
spaces having different numbers of components. 



Model (1.2) is known as a Mixture-of-Experts (ME) model and was first introduced by 
Jacobs et al. (1991) and Jordan and Jacobs (1994), who used simple linear functions for the 
gj and estimated the model by the EM algorithm. 
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There is an extensive literature on estimating binary regressions. McCullagh and Nelder (1989) 
discuss parametric approaches. Nonparametric binary regression is discussed by Wang (1994, 
1997), Wahba, Wang, Gu, Klein and Klein (1997) and Loader (1999). Wood and Kohn (1998) 
and Holmes and Mallick (2003) present Bayesian approaches to nonparametric binary re- 
gression. However, none of these papers show that their estimates are locally adaptive. 
Kribovokova et al. (2006) present an estimator of a binary regression that is based on quasi 
likelihood and show that it is locally adaptive. 

A number of locally adaptive estimators have recently been proposed for Gaussian re- 
gression models. Most of these estimators represent the unknown regression function as a 
linear combination of basis functions. Frequentist approaches such as Friedman and Silver- 
man (1989), Friedman (1991) and Luo and Wahba (1997) sought an optimal combination of 
basis functions using a greedy search algorithm, whereas Bayesian approaches such as Smith 
and Kohn (1996) and Denison, Mallick and Smith (1998) averaged over a large number of 
combinations of subsets of the basis functions. 

Wood, Jiang and Tanner (2002) proposed a locally adaptive estimator for Gaussian re- 
gression by mixing over a combination of splines and used BIG to choose the number of 
components. Our article builds on this Mixture of Experts approach by Wood,et al. (2002). 

A direct way to obtain a locally adaptive estimator of binary probabilities is to adaptively 



estimate g in (1.1) by using latent variables together with the basis selection methods in 
Friedman (1991), Denison, Mallick and Smith (1998), and Smith and Kohn (1996) or with 
the mixture of splines method in Wood et al. (2002). However, we have found that for the 
mixture of splines approach, adaptively estimating the regression function g by mixing on 
the inside of ^ results in poor estimates of the probabilities due to overfitting. Figure [T] gives 
an example of such overfitting. The figure shows the true probability H{x) = Pr{y = l\x) 
and the estimate H{x) = ^(7j{x)) where g{x) = 7r(x)/i(x) -|- (1 — 7r(a;))/2(x), which we call 
mixing on the inside. The figure also shows the estimate of H{x) based on modeling H{x) as 
7r(x)<^{/i(j;)} -|- (1 — 7r(x))<^{/2(x)}, which we call mixing on the outside. In this example, 
which is typical of all such examples, it is clear that mixing on the inside does not perform 
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as well as mixing on the outside. The technical report by Wood, Kohn, Jiang and Tanner 
(2005) provides details of why mixing on the inside tends to result in overfitting and produces 
inferior estimates to mixing on the outside. 



2 Model and Prior Specification 

We present the model in this section. Appendix A gives details of the sampling scheme used 
to estimate the model. Using the results in Tierney (1994) and Green (1995) we can show 
that the sampling scheme converges to the posterior distribution. 

Let -u; be a binary response variable taking the values and 1. We model the binary 

regression of w on x by a mixture of finite but unknown number of probit regressions, as 

R r 
Ft{w = l\x) = Hr{x) Pr(r) , Hr{x) = ^ 7r,>(x)${5,>(a;)} (2.1) 

r=l j=l 

r 

with TTjrix) = exp{djrz)/ ^ exp{dk^z) ,j = l,...,r, 

k=l 

where z = (l,x')'. We usually take the number of components i? as 3 or 4 with Pr(r) = 1/R, 
for r = 1,...,R. Without loss of generality we assume that the vector Sir = and let 
Sr = {52rj ■ ■ ■ ■> Srr) be a vector of unconstrained coefficients. We observe wi, . . . ,Wn as well 
as the corresponding covariates xi, . . . ,Xn- 

To place a prior on gjr we write 

gjr{x) = a'jrZ + fjr{x) (2.2) 

where ajr is a coefficient vector and fjr{x) is the nonlinear part of gjr- For j = 1, . . . ,r, 
let ijr = ^fjr{xi), . . . , fjrixn)^ ■ We Write fjr as a linear combination of basis functions as 
outlined below so that fj> = XPjr, where the columns of the design matrix X are partial thin 
plate spline basis functions and /3j> is a vector of coefficients. Appendix B describes how we 
construct the design matrix X to handle a large number of observations n and a moderate 
number of covariates. 
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The prior for ajr for j = 1, . . . , r is N{0,CaI), for some large Cq, and we assume that 
the QjrS are independent apriori. Based on emprical evidence we found that the regression 
function estimates were insensitive to the choice of Cq over the range [10^,10"'^^]. The prior 
for (3jr ~ N{0,TjrI),j = 1, . . . ,r. We assume a uniform prior for Tir ~ U{0,Cr), for some 
large Cr ■ To ensure identifiability we assume apriori that Tj> ~ U (0, T(j_x)r) for j = 2, . . . , r, 
i.e., Tir <,.•■,< < Cr- The prior on the parameter vector 6r for the mixing probabilities 
is A^(0, csl)), where cs = n. 



3 Simulations 

3.1 Comparison with a single component estimator 

The performance of the proposed method is studied for four functions listed in table [T| using 
a sample size of n = 1000 for each function. The three univariate functions are plotted in 
Figure [sj The bivariate function is plotted in Figure |4][|a) . The four functions are chosen in 
such a way that each requires a different type of smoothing. Function (a) requires only one 
smoothing parameter, function (b) requires local smoothing, function (c) is a discontinuous 
function, and function (d) is a discontinuous bivariate function. The estimates obtained 



Function Label 


Function Formula 


a (sin) 
b (peak) 
(step) 

d (cylinder) 


Pr(ii; = l\x) = <I>{2sin(47rx)} 

Pr(^. = l|x) = c^iiexp [(^] + 1 exp [^^] - 1} 
Pr(ii; = l|x) = ^>{-1.036 + 2.073^(0.25) - 1.427124(0.75)} 
where Ix{o) = 1 if x > a 

Pr(u; = l|x) = D{{x - 0.5)^ + (y - 0.5)^ - 0.16^} 
where D{x) = 0.8 if x < and D{x) = 0.2 if x > 



Table 1: Regression functions used in simulations 
using the proposed method are compared to estimates obtained using a single component 
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estimator as in Wood and Kohn (1998). We use this estimator for comparison because 
Wood and Kohn (1998) show that their single spline estimator outperforms other available 
estimators, such as GRKPACK by Wang (1997). Fifty replications were generated for each 
regression function using a maximum R = 3 components. We use the average symmetric 
Kullback-Leibler distance to measure performance. This measure is also used by Gu (1992) 
and Wang (1994), and is defined below. Let 

I{H{xi),H{xi)} = Piiwi = O xi) log <^ ^ \ + Fi{wi = I xi) log <^ ^ \ . 

[ Pr{wi = 0\xi) J 1^ Pr{wi = l|xi) J 

By Rao (1973, pp. 58-59), I{H{xi), H{xi)} < and I{H{xi), H{xi)} = if and only if 
H{xi) = H{xi) at the design points. The average symmetric Kullback-Leibler distance 
(ASKLD) between H and H is defined as 

1 " 

ASKLD{H, H) = -y \l{H{xi),Hixi)} + I{H{x,),H{xi)] . 

n ^-^ L J 

1=1 

It follows from the properties of the Kullback-Leibler distance that ASKLD{H, H) < and 
ASKLD{H, H) = 0'd and only \iH = H. Thus, the closer I{H, H) is to zero the better. 

For each replication, the ASKLD was calculated for both estimators, where H{xi) is the 
estimate obtained for the function Pr{wi = l|xj). Figure [2] compares the performance of the 
mixture estimator with the performance of the single component estimator using boxplots, 
with each boxplot representing the average difference in ASKLD for the fifty replications for 
that function. A positive difference means that the mixture model performs better than the 
single spline estimator for that replication, while a positive difference means the opposite. 

Figure [2] shows that the mixture of splines estimator performs significantly better in terms 
of the ASKLD than the single spline estimator when the regression surface is heterogenous, 
that is when it requires local smoothing. In particular, for the heterogeneous functions 
(b), (c) and (d) the mixture estimator outperformed the single spline. Furthermore, when 
the function is homogenous, that is when it does not require local smoothing, the mixture 
estimator performs almost as well as a single spline estimator because when only one spline 
is needed, for example for function (a), the posterior probability of a single spline is high. 
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Table [2] gives the average (over the 50 repUcations) posterior probabihty of the number of 
spHnes needed for mixing for each of the fom' regression functions. The table shows that for 
the homogenous function (a) the average posterior probability that only one spline is need is 
0.78. In general we have found that for homogenous functions the posterior probability of a 
single component is high. Conversely, we have found that for heterogenous functions the pos- 
terior probability of requiring more than a single component is high. Thus, for function (b), 
the average posterior probability is 0.76 that two splines are needed and 0.24 that three 
spline are needed. For the piecewise constant functions (c) and (d), the average posterior 
probabilities that two splines are needed are 0.60 and 0.84 respectively, whereas the average 
posterior probabilities that three splines are need is 0.37 and 0.08 respectively. 



Function Label 


Pr(r lu) 




1 


2 


3 


a (sin) 


0.78 


0.21 


0.01 


b (peak) 


0.00 


0.76 


0.24 


c (step) 


0.03 


0.60 


0.37 


d (cylinder) 


0.08 


0.84 


0.08 



Table 2: Average posterior probability of number of splines needed 

To see how the difference in ASKLD translates into differences in the regression function 
estimates, the estimates for functions (a)-(c) corresponding to the 10th percentile, 50th per- 
centile and 90th percentile ASKLD are plotted in Figure |3j Figure [3] shows that an estimator 
based on a mixture of splines can accurately estimate both homogeneous and heterogeneous 
functions. When the function is homogeneous, e.g. function (a), the mixture estimates are 
visually indistinguishable from the single spline estimates. However, when the function is 
heterogeneous, the single spline estimates are much worse than the mixture estimates. For 
example, even the 90^^ percentile of the single spline estimate of function (b) fails to cap- 
ture the peak on the left and overfits on the right, whereas the mixture estimate does well 
throughout the whole range of the function. Figure |4] plots the fit for the cylinder data for 
a single spline and a mixture of splines. The improvement is marked, with the single spline 
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being unable to capture the sudden change of curvature. 



3.2 Comparison to a locally adaptive estimator 

Kribovokova et al. (2006) propose a general approach for locally adaptive smoothing in 
regression models with the regression function modeled as a penalized spline that has a 
smoothly varying smoothing parameter function which is also modeled as a penalized spline. 
In particular, their approach handles local smoothing of binary data. The authors have 
implemented their approach in a package called Adaptfit which is written in R and is available 
at http: / / cran.r-project.org/ src/contrib/Descriptions/ AdaptFit.html This section compares 



our approach to that of Kribovokova et al. (2006) as implemented in Adaptfit in terms average 
squared error, coverage probabilities and running time for the four functions (a)-(d). The 
results reported below are each based on 50 replications unless stated otherwise. 

Let ASJ^ME be the average squared error over the abcissae of the data of the Bayesian 
ME estimator and let ASE^i;' be the corresponding averaged squared error of the Adaptfit 
estimator. We define the percentage change in going from the ME estimator to the adaptfit 
estimator as 

%AASE = ASEaf - ASEme ^ 
ASEme 

Table [s] reports the 25th, 50th, 75th percentiles and the mean of %AASE and shows that the 
ME estimator and the Adaptfit estimator perform similarly for functions (a) and (b), but 
that the ME of experts estimator outperforms the Adaptfit estimator for functions (c) and 
(d). 



Let the empirical coverage probability ^CY^ me{x) be the proportion (out of 50) of yu7o 
pointwise confidence intervals that contain the true probability at the abcissa x for the ME 
estimator. Let ECP^i;'(x) be defined similarly for the Adaptfit estimator. Figure [s] plots 
'ECV ME{xi) and YjC^ AF{xi) at the abcissae Xi of the data for the functions (a)-(c). Figure [6] 
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is a similar plot for the cylinder function (d). Let 

%AAECPme = 100 X (n-^J2^'^^ME{xi) -0.9]/ 0.9 

^ 1=1 ' 

be the percentage deviation from 0.9 of the average of the YjQV uEi^i) over all the Xj abcissae 

of the data. Let %AAECPyii? be defined similarly for the Adaptfit estimator. Table [s] 

reports both %AAECPm_e and %AAECPaf- Figure|5]and t able |3] suggest that the empirical 

coverage probabilities of the ME estimator and Adaptfit are similar for functions (a) and (b), 

while the ME estimator has superior empirical coverage probabilities for functions (c) and 

(d). 



Function 




%AASE 




%AAECP 




25 


50 


75 


mean 


ME 


AF 


(a) Sin 


-11.32 


7.86 


29.64 


11.18 


-0.74 


3.10 


(b) Peaks 


-8.03 


-0.703 


15.07 


7.72 


-3.34 


-5.89 


(c) Step 


52.23 


93.15 


140.23 


101.99 


-9.07 


-18.12 


(d) Cylinder 


86.28 


130.76 


162.11 


126.83 


-16.80 


-26.62 



Table 3: Comparison of Bayesian ME and adaptfit. The first four columns give the 25th, 
50th , 75th percentiles and the mean of the percentage difference between the averaged mean 
squared error of adaptfit and the ME estimators. The next two columns gives the percentage 
coverage errors of the 90% confidence intervals for ME and Adaptfit. 

We now discuss some computational issues that determine the performance of Adaptfit 
in the binary regression case. Adaptfit uses two sets of knots. The first set of knots is for 
the penalized spline basis for the regression function and the second set of knots is for the 
penalized spline basis for the smoothing parameters. See Kribovokova et al. (2006) for details. 
Let K\j be the number of knots chosen for the first penalized spline and the number of 
knots chosen for the second penalized spline. We found in the binary case that if a function 
requires a locally adaptive estimator then to get a satisfactory fit it may be necessary to 
take K\j quite large. For example, for the peak function (b) and the cylinder function (d) it 
seems necessary to take K^, = 120 to 150. The results for functions (a), (c) and (d) in table |3] 
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were obtained using using Ki, = 150 and Kc = 20. The results for function (b) (peak) were 
supplied to us by Dr Tatyana Krivobokova who used an Splus implementation of Adaptfit 
because we were unsuccessful with the R implementation . The results reported for Adaptfit 
for function (b) arc for 47 replicates with the other three replicates being unsatisfactory. 
Adaptfit also has a default choice of and Kc and we checked that the above settings 
for functions (a), (c) and (d) gave as good or better performance as the defaults. The times 
given below are averages over several replications and were obtained on a 2.8GHz PC running 
Matlab 7. For function (a), Adaptfit took 25 seconds for the default number of knots and 
320 seconds for {Kt„Kc) = (150,20). For function (c), Adaptfit took 36 seconds for the 
default number of knots and 325 seconds for {Kb, Kc) = (150, 20). For function (d), Adaptfit 
did not give satisfactory results for the default number of knots and took 316 seconds for 
{Kiy,Kc) = (150,20). The ME estimator takes about 90 seconds per 2000 iterations for 
each of the examples in this section. For each of the examples reported in this section and 
the next section we ran the ME estimator for 10000 iterations in total (5000 warmup and 
5000 sampling), that is, the time taken for each of the examples in this section is about 
450 seconds. However, we have found in extensive testing that it is sufficient to use 4000 to 
6000 iterations in all of these examples, that is about 180 to 270 seconds in total. All the 
times reported were recorded on a 2.8GHz PC running Matlab 7. Thus Adaptfit can be very 
fast when the default number of knots is used, but for a binary regression it is difficult to 
determine apriori for any data set whether the default number of knots is adequate and in 
our opinion it is safer to take the more conservative approach by setting K^ to 120 or 150. In 
that case the times required by Adaptfit and the ME estimators are not that different, while 
the ME estimator in the binary case appears computationally more robust. 
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4 Real Examples 



4.1 Probability of crested lark sighting in Portugal 

This section demonstrates how the proposed method can be used to model spatial data by 
modelling the probability of a crested lark sighting at various locations in Portugual. The data 
were obtained from Wood (2006). Each observation refers to one tetrad (2km by 2km square) 
and contains a variable indicating whether the crested lark was sighted in the tetrad or not 
together with the location of the tetrad. The location of a tetrad is identified by kilometers 
east and north of an origin. Portugal can be divided into 25100 tetrads for which there were 
observations on 6457 tetrads. This dataset was analysed by Wood (2006) who aggregated the 
data into 10km by 10km squares and fitted a binomial generalized additive model (GAM) 
using thin plate regression splines to the aggregated response. In their example the degree of 
smoothness was estimated using the un-biased risk estimator (UBRE) , which was then scaled- 
up by a factor of 1.4 to avoid overfitting. The factor by which the smoothing parameter 
is rescaled is chosen subjectively and affects the estimated probabilities substantially. In 
contrast, our method allows for the degree of smoothness to vary across the covariate space 
and the estimated probabilities are therefore spatially adaptive . 



Our model for the probability of a crested lark sighting is given by (2.1) and (2.2) with 
X = (east, north), where east and north mean kilometers east and north of an origin. The 
dependent variable is 1 if a crested lark is sighted in a tetrad and if it is not. We assume 
a maximum of four mixture components. 

We considered the choice of the number of mixture components in several ways. First, 
the posterior probabilities for 1 to 4 components are 0.0, 0.94, 0.04 and 0.01, suggesting a 
two component mixture. We also looked at the empirical receiver operating curves (ROC) 
for models with 1 to 4 components. See Fan, Upadhye and Worster (2006) for a description 
of ROC curves. In our case a ROC curve shows the trade off between classifying a tetrad as 
containing the crested lark when the tetrad does contain the crested lark versus classifying 
a tetrad as containing the crested lark when it does not. The larger the area under the 
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ROC curve the more effective is tlie model at classifying. Out of 6457 observations we 
randomly selected 5817 for model fitting and set aside 640 for model testing. We estimated 
the probability that a tetrad contains the crested lark using mixture models containing one 
to four components. For each mixture model we then used the estimated probabilities to 
classify the 640 observations set aside for model testing. Figure [7] plots the empirical ROC 
curves based on these 640 observations and shows the improvement in classification that is 
achieved by using a mixture of two splines over a single spline. Figure [7] also shows that there 
is not much improvement in classification in moving from a mixture of two to a mixture of 
three or four and hence supports the choice of a mixture of two splines as the preferred model. 
This finding is consistent with the posterior probabilities of the number of components. 

Figure |8] shows contour plots for the probability of a crested lark sighting for a single 
spline estimator and a mixture of splines estimators. This figure and the land use and 
population maps, reproduced in figure, |9] show why a mixture of at least 2 splines is necessary. 
Mainland Portugal is split by the river Tagus. The population and land use maps show that 
the population of Portugal is concentrated to the north of the river, in particular in the 
northwest where the land is given over to wine production. The interior of the north is dry 
and mountainous. To the south of the river the predominant land form is rolling hills and 
the area is much less densely populated. The cultivated areas in the south are primarly for 
cork production but there are large tracts of forested areas. 

Figure [sj^b) shows that in the southern part of Portugal the probability of sighting a 
crested lark varies considerably and that these variations can occur abruptly. These abrupt 
changes correspond to changes in the topography of southern Portugal. The areas of high 
probability correspond to forested/tree crop areas or major rivers. The areas of low proba- 
bility correspond to pasturable lands. In contrast the probability of a crested lark sighting in 
the northern part of Portugal has little variation; the high population density together with 
the mountainous interior means that the probability is of sighting is uniformly low. Thus the 
degree of smoothing required depends on the covariates east and north. Figure [s] (a) shows 
that a single spline estimator cannot simultaneously capture the abrupt changes in southern 
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Portugal and the smooth changes in the north. 



4.2 Probability of belonging to a union 

This example shows how our methodology can be extended higher dimensions by modeling the 
probability of union membership as a function of three continuous variables, years education, 
wage and age, and three dummy variables, south (l=live in southern region of USA), female 
(l=female) and married (l=married). The data consists of 534 observations on US workers 
and can be found in Berndt (1991) and at http://lib.stat.cmu.edu/datasets/CPS_85_Wagesl 



Ruppert, Wand and Carroll (2003) estimate the probability of union membership using a 
generalized additive model without interactions. We model the three dimensional surface of 
the continuous covariates and our results suggest that this more appropriate than an additive 



model. Our model for the probability of union membership is given by (2.1) with x ={years 



education, wage, age, south, female, married). However, we modify the regression function 



(2.2) to take into account that three of the covariates are dummy variables and should be 



excluded from the non-parametric component by writing 

gjr{x) = a'j.,z + fjr{x*) 

where x* = {years education, wage, age), wage is in US $/hr and age is in years. The 
dependent variable is 1 if the worker belongs to a union and otherwise. 



Our method chooses one component 100% of the time. Figures 10 (a) - (c) show the 
joint marginal effect of two covariates at the mean of the third one and setting the dummy 
variables to zero. These figures clearly show interactions among the continuous covariates. 



For example figure 10 (a) shows that for workers whose age is less than 40, the probability 
of union membership initially increases with wage, before reaching a peak at a wage of about 
$15 /hr and then declines. For older workers this peak occurs at much lower wages, somewhere 



between $5/hr and $10/hr before declining sharply. Figure 10 (b) shows two modes. For 
workers with an average wage, union membership peaks at 55 years and 8-10 years education. 
Interestingly union membership peaks again at 55 years and 18 years education, although 
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this peak may be due to boundary effects. Figure 10 (c) shows that for workers who did not 
finish high school (< 12 years education) the probability of belonging to union increases as 
wage increases. In contrast, for workers with some tertiary education(> 14 years education) 
the probability of belonging to a union is initially high and then decreases with increasing 
wage. 



Acknowledgment 

We thank the referee, the associate editor and the editor for suggestions that improved the 
content and quality of the paper. We would especially like to thank Dr Tatyana Krivobokova 
for helping with the Adaptfit package and for carrying some of the computations for us. Sally 
Wood, Robert Kohn and Remy Cottet were supported by an ARC grant. 



Appendix A: Sampling scheme 

We estimate the binary regression probabilities by their posterior means, with all unknown 
parameters and latent variables integrated out. To make it easier to simulate from the 
posterior distribution we introduce a number of latent variables that are generated during 



the simulation and turn (2.1 ) into a hierarchical model. The first is the number of components 
r at any point in the simulation. Given r, define the vector of multinomial random variables 
7j, = {7r.(xi), . . . ,^r{xn)} such that 7r(xi) identifies the component in the mixture that Wi 
belongs to. We assume that 

n 

Pr(7^|r, (5r-,x) = JJPr{7r(xi)|r, , with Pr{7r.(xi) = i|(5r} = vrj>(xi) and 

i=l 
n 

Pr(w|r,7^,gi^) = JJ Pr{u;i|7r(a;i) =j,gjr{xi)} 
1=1 

To estimate the component splines, we follow Albert and Chib (1993) and Wood and 
Kohn (1998) and introduce a second level of latent variables, Vijr, also conditional on r, such 
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that 



Vijr = ZiUjr + fjr{xi) + eijr for j = l,...,r and i = l,...,n, 

where eijr ^ 1). The latent variable Vjir and the indicator variable 7r(3^i) s-rs related to 
each other and the observation Wi by requiring that 



If 'Jrixi) 7^ j, then is unconstrained. 

The sampling scheme moves between models with differing numbers of components by 
using reversible jump MCMC. To implement a reversible jump step to go from a model with 
r components to a model with r' components it is necessary to have a proposal density in 
the r' component space. We form such proposals by first running separate MCMC samplers 
for each r = 1, . . . ,R component models. This sampling scheme is described below under the 
heading of 'Updating within a model.' 

We now describe the complete sampling scheme. First, r is initialized by drawing it from 
the prior Pr(r = j) = 1/R,j = l,...,r. Conditional on this value of r, we initialize /3^, 
ocrjTr, Sr by the posterior means of the iterates of the model with r components. 

1. Moving Between Models 

Let X*^ = (r*^, 6^) be the current value of the parameters in the chain, where 6^ = 
(a^,/3^,5^). We propose a new value of XP = {rP,Qr) and accept this proposal using 
a Metropolis-Hastings (M-H) step. The M-H probability of accepting such a proposal 
is 



where q{X^, X^) is an arbitrary transition probability function that moves the chain 



Vijr > if lUj = 1 and ^yrixi) = j 
Vijr < if = and "frixi) = j. 




(A.l) 



from X" to XP. 
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The proposal density q{X'^,X'P) is given by q{r'^ — r*')g(9^c — 6^p|r^). That is, a new 
value rP of r is proposed and 9^ is proposed conditional r^. The value of is proposed 
as follows: 

(a) If 1 < < i? then g(r^ -^rP = r''±l) = 0.5 

(b) If r'^ = 1 then g(r^ ^ = 2) = 1 

(c) If r'^ = then g(r'= ^ = - 1) = 1 

Then, conditional on r^, we propose new values of the parameter G^p = {S^p , P^p , ot^p) 
by doing the following. To simplify notation, we write as r. 

(a) Draw S^. from MFr5(5, ), where dr and are the sample mean and co- 
variance of the iterates sf^^ from the individual MCMC scheme for a mixture 
of r components. We use the notation MVT^{a,B) to denote a multivariate t 

distribution with 5 degrees of freedom, location vector a and scale matrix B. 

(b) Draw (3*/ = {/3P,a^) from MFTg^*,!:^.). where ^*and E^* are the sample 
mean and covariance of the iterates f3r from the individual MCMC scheme for 
a mixture of r components. 

The sampling scheme for a model of an r component mixture is identical to the scheme 
for a within model move (step 2 below) . 

2. Updating within Model 

Given the new values of r, 6r, and g^., the parameters specific to that model are updated 
as follows: 

(a) Draw 7^, simultaneously from P{'y,Vr\w,g^,Sr,Tr) by first drawing 7^ and 

then drawing V,- conditional on the value of 7,.. 

i. To draw 7^ from Pr(7^|t«, 5^, g^) note that 

n 

PT{'y^\w,Sr,gr) = '[]_'P^{lr{Xi)\Wi,Sr,gjr{Xi)}. 
1=1 
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If Wi = 1 then, 

PT{Wi = l|7r(a:i) = j,Sr,gjr{xi)}Pl[{'yr{xi) = j\dr} 



Fri'Jrixi) = j\Wi = l,dr,gr} 



Y^l=iPT^{'Wi = l|7r(a;i) = k,gkrixi),dk}PT:{'yr{xi) = k\Sr} 

^{gjr{Xi)}Pv{'yr{xi) = j\6r} 



El=lH9k{Xi)}Pr{Jr{x^) = k\Sr}' 

Similarly, if = then, 

[1 - <^{gjr{xi)}] Pr{jriXi) = j\Sr} 



Pr{7,.(j;i) = j\wi = 0, 5^,, gi,,} 



ELJl - HdkAxi)}] Pr{7,(xO = k\Sr}' 
ii. To draw Vr from p{Vr\w,-fj.,g.^.) note that, 

n r 

p{Vr\,W,^r,gr) =Y[Y[p{vijr\gjT{xi),Jr{xi),Wi). 

i=lj=l 

If 7r(xi) = j and Wi = 1, then Vijr ~ N(gjr{xi), 1), and is constrained to be 
positive. 

If "frixi) = j and Wi = 0, then Vijr ~ N{gjr{xi), 1), and is constrained to be 
negative. 

If ^r{xi) 7^ j then draw Vijr from its unconstrained distribution, which is 
N{gjr{xi), 1). 

Draw 6r from p(5r|7r) using a M-H step. The conditional posterior distribution 
of Sr is 



and 6r ~ -/V(0, 10/). Our proposal density is MVT5{6max,V5) where Smax is that 



value of Sr that maximizes p{Sr\w,y^) in (A. 2), and is the negative of the 
inverse of the second derivative of log [p{Sr\w,jj.)]- 

Draw j3,,.,OLr simultaneously from ctj.] Vj., r^) by first drawing a,, and then 
conditional on this value of cxr drawing /3^: 
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i. To draw note that 

r 

p{0Cr\Vr,Tr) = Y[p{ajr\Vj,Tjr) 

j=l 

and p{ajr\vj,Tjr) ^ N{M.a,Va) where, 

Va = [Z'Z - Z' XTjr[TjrX' X + I]'^ X' Z]'^ 

and 

Ma = Va [Z'vjr - Z' XTjr[TjrX' X + I\-^X'vjr\ 

ii. To draw (3^. note that 

r 

i=i 

and p(^l3jj-\Vjr, Tjr, Ctjr) ^ N{M/3,Vi3) where 

V^ = Tjr[TjrX'X + 1]-^ 

which is diagonal and 



= VfjX'v 



jr 



and Vjr = Vjr — Zajr- 
(d) Draw simultaneously from p(Tr|/3^) by drawing from 

r 

and then re- label 7^(2^1) for i = 1, . . . , n so that Tir,> . . . , > r^r (see Stephens, 
2000). 

Appendix B: Constructing the design matrix 

This appendix outlines how we construct the design matrix X to allow for a large number of 

observations and a moderate number of covariates. 
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1. Normalize the values of all covariates to lie in the interval [0,1] 

2. Choose the number m and location of knots, so that in a given hypercube of width 
e, (e is typically chosen to be 0.05) and dimension a knot is placed at the centre 
of gravity of the hypercube. If there are no points in a hypercube then a knot is not 
chosen for that hypercube. If the data are equally spaced across a grid where the grid 
length = e then this results in m = n. If the data are clustered, as is often the case in 
high dimensional data, then this technique results in m < n. 

3. Let Xj be the position of the j*^ knot and let be the i^^ row of the normalized 
covariates. Radial basis functions, denoted by ^jj, are constructed such that = 

— Xjll" *log(||a;* — Xjll), a = 2*ceil(| + 0.1) — p, where ceil(a;) means to round x up 
to the nearest integer value. This means that the element of the n x m design 

matrix X is equal to (pij for i = 1, . . . ,n and j = 1, . . . ,m. 

4. To limit the dimension of the design matrix we take a singular value decomposition 
of X, s.t. X = UAV' where U and V' are square, orthonormal matrices and A is an 
n X m matrix, with nonnegative numbers on the diagonals, Xu for i = 1, . . . ,m where 

All > • • • > Knm, and zeros off the diagonal. Wc then let Xu = for i > I' , where 
I' = 25. We choose I' in this way because typically 1 - Y.^=i >Si/ YT=i ^1 < 1 x 

5. We re-form X by letting X = UA. The design matrix X is now a n x I' matrix. Note 
that XX' = Uh?U' has the eigenvalue decomposition QDQ', so that the resulting n x I' 
design matrix could have been formed by performing an eigenvalue decomposition on 
the n X n matrix XX' = QDQ' and setting = for i > /', however if n is large 
performing an eigenvalue decomposition is computationally intractable. 
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Figure 1: The soUd hne gives the true function H{x) (Table [T] function (b)). The dotted hne 

(...) is the estimate H(x) for mixing on the inside while the dashed line ( ) is the estimate 

H(x) for mixing on the outside. 
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Figure 2: Boxplots of the ASKLD between an estimate based on a mixture of splines and an 
estimate based on a single spline for the functions in table [T| Note that if the ASKLD > 
then the estimator based on a mixture of splines is better than the estimator based on a 
single spline. 
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Figure 3: Panels (a)~(c) plot the estimates corresponding to the 10th worst percentile ASKLD 
for functions (a) - (c) in table [T| Panels (d)-(f ) and panels (g)-(i) are similar plots corre- 
sponding to the 50th percentile ASKLD and 10th best percentile ASKLD, respectively. In 
all cases n = 1000 and the true function H{x) is given by the dotted line, the estimate based 
on a mixture is given by the thick solid line and the estimate based on a single spline is given 
by the thin solid line. 
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Figure 4: Cylinder data, panel (a) plots the true function and data, panel (b) plots the 
estimate for a single spline and panel (c) plots the estimate for a mixture of splines. 
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Figure 5: Plots of the pointwise empirical coverage probabilities for the mixture of experts 
(ME) and adaptive fit (AF) estimators when the nominal coverage probability is 0.9. The 
plots are for the functions (a)~(c). 
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Figure 6: Plots of the pointwise empirical coverage probabilities for the mixture of experts 
(ME) and adaptive fit (AF) estimators when the nominal coverage probability is 0.9. The 
plots are for the function (d). 
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;ure 7: ROC for out of sample data (data) for different number of mixture components. 
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Figure 9: Map of Portuguese land use panel (a) and population density 
panel (b). Produced by the Central Intelligence Agency and downloaded at 
www.lib.utexas.edu/maps / portugal.html . 
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Figure 10: Plot of Pr (Union Member| wage, age) at the mean years education panel (a); 
Pr(Union Member | age, years education) at the mean wage panel (b); Pr(Union Member| 
wage, years education) at the mean age panel (c). In all plots the dummy variables set at 
zero. 
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